I’m going to do a very simple analysis of Baltimore crime to show off R. We’ll use data downloaded from Baltimore City’s awesome open data site (this was downloaded a couple of years ago so if you download now, you will get different results).
Let’s load the data:
arrest_tab=read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)
cctv_tab=read.csv("CCTV_Locations.csv", stringsAsFactors=FALSE)
# these columns are mislabeled, so fix them
tmp=arrest_tab$sex
arrest_tab$sex=arrest_tab$race
arrest_tab$race=tmp
# dimension of table (data.frame)
dim(arrest_tab)
## [1] 104528 15
# what are the columns
names(arrest_tab)
## [1] "arrest" "age" "sex"
## [4] "race" "arrestDate" "arrestTime"
## [7] "arrestLocation" "incidentOffense" "incidentLocation"
## [10] "charge" "chargeDescription" "district"
## [13] "post" "neighborhood" "Location.1"
# what is the average arrest age?
mean(arrest_tab$age)
## [1] 33.19639
# the range of arrest ages
range(arrest_tab$age)
## [1] 0 87
# how many arrests per sex
table(arrest_tab$sex)
##
## F M
## 2 19431 85095
# what are the most common offenses
head(sort(table(arrest_tab$incidentOffense),decreasing=TRUE))
##
## Unknown Offense 87-Narcotics 4E-Common Assault
## 38649 24744 6739
## 87O-Narcotics (Outside) 97-Search & Seizure 79-Other
## 6515 3670 3461
# what are the offenses that only happen once
tab <- table(arrest_tab$incidentOffense)
tab[tab == 1]
##
## 13-Assist Officer 2C-Carnal Knowledge
## 1 1
## 3EF-Robb Gas Station-Firearm 3EK-Robb Gas Station-Knife
## 1 1
## 62-Person Lying On St. 64-Drug Free Zone
## 1 1
## 6K-Larceny- Park. Meter 6L-Larceny- From Locker
## 1 1
## 70-Sanitation Complaint 8FO-Arson Public Str-Occ
## 1 1
## 8I-Arson Other Mobile 94-Abduction By Parent
## 1 1
# range of arrests after removing those w/ age==0
range(arrest_tab$age[arrest_tab$age>0])
## [1] 8 87
Offenses by sex
tab <- table(arrest_tab$incidentOffense, arrest_tab$sex)
Let’s see a table of arrests by sex and race
table(sex=arrest_tab$sex,race=arrest_tab$race)
## race
## sex A B H I U W
## 2 0 0 0 0 0 0
## F 0 37 14663 0 34 183 4514
## M 0 205 72605 1 184 1566 10534
A histogram of age
hist(arrest_tab$age,nc=100)
with(arrest_tab,hist(age[sex=="M"],nc=100)) # males only
with(arrest_tab,hist(age[sex=="F"],nc=100)) # females only
Let’s take a look at how age depends on sex. Let’s plot age as a function of sex first (notice how we indicate that sex is a factor).
plot(arrest_tab$age~factor(arrest_tab$sex))
One of the neat things about R is that statistical model building and testing is built-in. The model we use is \(y_i=\beta_0+\beta_1 x_i\) where \(y_i\) is age of sample (example) \(i\) and \(x_i\) is an indicator variable \(x_i \in \{0,1\}\) with \(x_i=1\) if the \(i\)-th record (example) is male. You can check that \(\beta_1\) is the difference in mean age between females and males. We use the formula syntax to build a linear regression model.
# let's ignore those records with missing sex
fit=lm(age~factor(sex),data=arrest_tab,subset=arrest_tab$sex %in% c("M","F"))
summary(fit)
##
## Call:
## lm(formula = age ~ factor(sex), data = arrest_tab, subset = arrest_tab$sex %in%
## c("M", "F"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.388 -10.153 -3.153 9.612 53.847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.38778 0.08472 394.117 <2e-16 ***
## factor(sex)M -0.23432 0.09389 -2.496 0.0126 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.81 on 104524 degrees of freedom
## Multiple R-squared: 5.958e-05, Adjusted R-squared: 5.002e-05
## F-statistic: 6.228 on 1 and 104524 DF, p-value: 0.01257
We see that \(\beta_1 \approx -0.2\) meaning that the arrest age for males is about 2.5 months younger. So there is very little difference in the average age (which is what the linear model is testing) but we see that the probability of observing this difference from a sample of this size when there is no difference in average age is small \(p \approx 0.01\). Since we have a very large number of examples, or records, this testing framework will declare very small differences as statistically significant. We’ll return to this theme later in class.
First we need to extract latitude and longitude from location, we’ll use some string functions to do this
tmp=gsub("\\)","",gsub("\\(","",arrest_tab$Location))
tmp=strsplit(tmp,split=",")
arrest_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
arrest_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
Now let’s plot
plot(arrest_tab$lon, arrest_tab$lat, xlab="Longitude", ylab="Latitude", main="Arrests in Baltimore")
We can also use density estimates to make this nicer:
smoothScatter(arrest_tab$lat, arrest_tab$lon, xlab="Latitude", ylab="Longitude", main="Arrests in Baltimore")
Let’s make this fancier using the ggplot2 graphics systems and the maps package containing map data.
library(maps)
library(ggplot2)
balto_map = subset(map_data("county", region="maryland"),subregion=="baltimore city")
plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=arrest_tab,aes(x=lon,y=lat),color="blue",alpha=.1)
print(plt)
## Warning: Removed 40636 rows containing missing values (geom_point).
Now let’s add CCTV cameras.
tmp=gsub("\\)","",gsub("\\(","",cctv_tab$Location))
tmp=strsplit(tmp,split=",")
cctv_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
cctv_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=arrest_tab,aes(x=lon,y=lat),color="blue",alpha=.1)
plt=plt+geom_point(data=cctv_tab,aes(x=lon,y=lat),color="red")
print(plt)
## Warning: Removed 40636 rows containing missing values (geom_point).
Is there any relationship between the number of CCTV cameras and the number of arrests? Divide the city into a grid and plot the number of CCTV cameras vs. the number of arrests.
# step 1: divide city intro grid for arrest data
# step 1a: find the range of latitude and longitude
latRange=range(arrest_tab$lat,na.rm=TRUE)
lonRange=range(arrest_tab$lon,na.rm=TRUE)
# step 1b: discretize latitude into 50 bins
latGrid=seq(min(latRange),max(latRange),len=50)
latFactor=cut(arrest_tab$lat,breaks=latGrid)
# now longitude
lonGrid=seq(min(lonRange),max(lonRange),len=50)
lonFactor=cut(arrest_tab$lon,breaks=lonGrid)
# step 1c: make a factor indicating geographic grid location
gridFactor=factor(paste(lonFactor,latFactor,sep=":"))
# step 2: do the same for the cctv data
latFactor=cut(cctv_tab$lat,breaks=latGrid)
lonFactor=cut(cctv_tab$lon,breaks=lonGrid)
cctvGridFactor=factor(paste(lonFactor,latFactor,sep=":"))
arrestTab=table(gridFactor)
cctvTab=table(cctvGridFactor)
m=match(names(cctvTab),names(arrestTab))
plot(arrestTab[m]~factor(cctvTab),xlab="Number of CCTV cameras", ylab="Number of Arrests")
As part of Project 1 you will add to this analysis. Please use the following template:
What question are you asking?:
I am trying to answer: at what time are most people arrested? For this prelimary analysis, I plan on making a plot where hour is the x-axis and the number of arrest is the y-axis. This will produced an “Arrest Timeseries.
What is the code you use to answer it?:
time <- strptime(arrest_tab$arrestTime, "%H:%M")
arrest_tab$hours <- as.numeric(format(time, "%H"))
hours_df <- as.data.frame(table(arrest_tab$hours))
names(hours_df) <- c("hour","count")
g <- ggplot(hours_df, aes(hour, count, group=1))+geom_line(color="blue")+geom_point(color="blue")
g <- g+labs(title = "Arrest Timeseries", x="Time of Day",y="Num of Arrests")
g <- g+scale_x_discrete(breaks=seq(0,23,2))
g <- g + theme(plot.title=element_text(size=16,face="bold"),axis.title.x=element_text(size=16,face="bold"),axis.title.y=element_text(size=16,face="bold"))
g
What did you observe?
I had originally thought that there would be very little arrests until 8 pm at which point there would be a giant spike from 8 pm to 5 am. But that was not the case. Instead, the two biggest hours of arrest were 6 pm followed by 10 am (!!). At this point, I’m not entirely sure why that might be. I would be surprised, however, if all offenses followed this exact same pattern.
What question are you asking?: I am trying to find out how cameras affect the sorts of crimes in their area, both in reducing certain types of crime, or leading to finding more of other types of crime.
What is the code you use to answer it?:
# modified code from above, to create factors, but remove NA
# added by HCB to restore original arrest table
arrest_tab_original = arrest_tab
#
arrest_tab = arrest_tab[!is.na(arrest_tab$lat) & !is.na(arrest_tab$lon),]
latRange=range(arrest_tab$lat,na.rm=TRUE)
lonRange=range(arrest_tab$lon,na.rm=TRUE)
latGrid=seq(min(latRange),max(latRange),len=50)
latFactor=cut(arrest_tab$lat,breaks=latGrid)
lonGrid=seq(min(lonRange),max(lonRange),len=50)
lonFactor=cut(arrest_tab$lon,breaks=lonGrid)
gridFactor=factor(paste(lonFactor,latFactor,sep=":"))
latFactor=cut(cctv_tab$lat,breaks=latGrid)
lonFactor=cut(cctv_tab$lon,breaks=lonGrid)
cctvGridFactor=factor(paste(lonFactor,latFactor,sep=":"))
arrestTab=table(gridFactor)
cctvTab=table(cctvGridFactor)
#count crimes in areas with and without camera
arrestOnCamera = gridFactor %in% names(cctvTab)
count_crime_tab <- table(arrest_tab$incidentOffense, arrestOnCamera)
#merge the two tables, and calculate the difference in crime frequency in the two situations
crime_tab <- data.frame(count_crime_tab[,1], count_crime_tab[,2])
colnames(crime_tab)[1] <- "noCamCrimes"
colnames(crime_tab)[2] <- "camCrimes"
crime_tab$names <- rownames(crime_tab)
crime_tab$campct <- crime_tab$camCrimes/sum(crime_tab$camCrimes)*100
crime_tab$nocampct <- crime_tab$noCamCrimes/sum(crime_tab$noCamCrimes)*100
crime_tab$pctchange <- crime_tab$campct - crime_tab$nocampct
#display the change in crime frequency with crime name in descending order, with the most increased (caught) crimes first
crime_tab <- crime_tab[with(crime_tab, order(-pctchange)), ]
options(scipen=999)
subset(crime_tab, select=c("pctchange"))
## pctchange
## 87-Narcotics 8.7959299656
## 87O-Narcotics (Outside) 2.1085712279
## 55-Disorderly Person 0.4899152469
## 109-Loitering 0.1601595042
## 79-Other 0.1421819771
## 54-Armed Person 0.0990559521
## 2A-Rape (Force) 0.0638557474
## 6B-Larceny- Purse Snatch 0.0539250917
## 108-Liquor Law/Open Container 0.0514977916
## 117-Fto 0.0498783769
## 1A-Murder 0.0484986356
## 2B-Rape (Attempt) 0.0480375073
## 3D-Robb Comm. (Ua) 0.0476871064
## 61-Person Wanted On War 0.0430503173
## 7C-Stolen Veh./Other 0.0404124933
## 2F-Placing Hands 0.0398369938
## 2D-Statutory Rape 0.0392614943
## 3AO-Robb Hwy-Other Wpn 0.0381032079
## 88-Unfounded Call 0.0316364803
## 78-Gambling 0.0310500497
## 116-Public Urination / Defecation 0.0301387241
## 49-Family Disturbance 0.0279328789
## 5C-Burg. Res. (Noforce) 0.0253984949
## 58-Injured Person 0.0226317250
## 3CK-Robb Comm-Knife 0.0165117545
## 60-Sick Person 0.0163973833
## 8AO-Arson Sin Res Str-Occ 0.0148959835
## 80-Lost Property 0.0130478265
## 20H-Traffic Control 0.0122399410
## 2H-Indecent Exp. 0.0114284118
## 48-Involuntary Detention 0.0094658836
## 77-Dog Bite 0.0087760130
## 73-False Pretense 0.0083112410
## 106-Custody Dispute 0.0072746132
## 113-Littering 0.0069278560
## 8J-Arson Other 0.0061199705
## 76-Child Abuse-Sexual 0.0046185707
## 2C-Carnal Knowledge 0.0038106852
## 3M-Robb Bank (Ua) 0.0038106852
## 64-Drug Free Zone 0.0038106852
## 6K-Larceny- Park. Meter 0.0038106852
## 6A-Larceny- Pickpocket 0.0034639280
## 93-Abduction - Other 0.0034639280
## 3B-Robb Highway (Ua) 0.0024054381
## 20J- 0.0011546427
## 84-Bomb Scare 0.0011546427
## 8BO-Arson Oth Res Str-Occ 0.0011546427
## 104-Malicious Burning 0.0008078855
## 2G-Sodomy/Perverson -0.0013870287
## 59-Intoxicated Person -0.0015013998
## 71-Sex Offender Registry -0.0015013998
## 3P-Robb Misc. (Ua) -0.0017337858
## 3AJK-Robb Carjack-Knife -0.0018481570
## 13-Assist Officer -0.0026560425
## 3F-Robb Gas Sta. (Ua) -0.0026560425
## 6L-Larceny- From Locker -0.0026560425
## 8I-Arson Other Mobile -0.0026560425
## 6H-Larceny- From Machine -0.0030027997
## 3GK-Robb Conv Store-Knife -0.0041574423
## 8H-Arson Motor Veh -0.0041574423
## 6F-Larceny- Bicycle -0.0043898283
## 3GF-Robb Conv Store-Firearm -0.0045041995
## 112-Traffic Related Incident -0.0048546003
## 85-Mental Case -0.0051977138
## 102-Questional Death -0.0053120850
## 110-Summons Served -0.0053120850
## 28-Suicide - Attempt -0.0053120850
## 3NF-Robb Misc-Firearm -0.0063523565
## 3NO-Robb Misc-Other Wpn -0.0068134848
## 3JF-Robb Residence-Firearm -0.0070458708
## 56-Missing Person -0.0073926280
## 3GO-Robb Conv Store-Other Wpn -0.0079681275
## 3LF-Robb Bank-Firearm -0.0079681275
## 3LO-Robb Bank-Other Wpn -0.0079681275
## 70A-Ill. Dumping -0.0083148847
## 81-Recovered Property -0.0097055570
## 67-Child Abuse-Physical -0.0098162845
## 3H-Robb Conv. Stor.(Ua) -0.0121255698
## 52A-Animal Cruelty -0.0121255698
## 103-Dead On Arrival -0.0124723270
## 2J-Other Sex Offn. -0.0132802125
## 3CO-Robb Comm-Other Wpn -0.0136269697
## 3JO-Robb Residence-Other Wpn -0.0136269697
## 114-Hindering -0.0152463843
## 3BJ-Robb Carjack(Ua) -0.0158218838
## 39-Fire -0.0174376548
## 83-Discharging Firearm -0.0174376548
## 3CF-Robb Comm-Firearm -0.0191714407
## 3NK-Robb Misc-Knife -0.0215950971
## 96-Stop & Frisk -0.0215950971
## 3AJF-Robb Carjack-Firearm -0.0236756401
## 29-Driving While Intox. -0.0245978968
## 5E-Burg. Oth. (Att.) -0.0257525395
## 3AK-Robb Hwy-Knife -0.0314150253
## 6E-Larceny- Auto Acc -0.0348789533
## 98-Child Neglect -0.0400730235
## 5F-Burg. Oth. (Noforce) -0.0405341518
## 3JK-Robb Residence-Knife -0.0454994796
## 3AF-Robb Hwy-Firearm -0.0496678530
## 23-Unauthorized Use -0.0497749368
## 95-Exparte -0.0556625213
## 4F-Assault By Threat -0.0558985510
## 26-Recovered Vehicle -0.0613322945
## 6D-Larceny- From Auto -0.0768110648
## 5B-Burg. Res. (Att.) -0.0876530461
## 20A-Followup -0.0903127323
## 111-Protective Order -0.0968938311
## 6J-Larceny- Other -0.1005974324
## 3K-Robb Res. (Ua) -0.1019735301
## 5D-Burg. Oth. (Force) -0.1114467011
## 6G-Larceny- From Bldg. -0.1242657852
## 7A-Stolen Auto -0.1317800717
## 115-Trespassing -0.1403564919
## 4A-Agg. Asslt.- Gun -0.2195365579
## 75-Destruct. Of Property -0.2844580826
## 4D-Agg. Asslt.- Hand -0.3185182194
## 6C-Larceny- Shoplifting -0.4077418953
## 4B-Agg. Asslt.- Cut -0.4938461736
## 5A-Burg. Res. (Force) -0.5879404416
## 4C-Agg. Asslt.- Oth. -0.6379805575
## 24-Towed Vehicle -1.0684576178
## 97-Search & Seizure -1.1822536848
## 55A-Prostitution -1.1994275774
## Unknown Offense -1.5811234143
## 4E-Common Assault -2.7616680093
# added by HCB to restore original arrest table
arrest_tab = arrest_tab_original
What did you observe? The results were interesting. We see a large increase in charges of narcotics, which may be due to camera surveillance. We also see a decrease in assault, which may be due to the perpetrators of such crimes realizing the dangers of committing such crimes in front of a camera. However, the vast majority of crimes do not even see a 1% change between the two situations, so it would appear as though, overall, cameras do not have a major affect on criminal activity.
What question are you asking?: Which neighborhoods in Baltimore have the higest number of arrests?
What is the code you use to answer it?:
Load libraries
library(rgdal) # needed for reading shape files
## Loading required package: sp
## rgdal: version: 0.9-1, (SVN revision 518)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.10.1, released 2013/08/26
## Path to GDAL shared files: /usr/share/gdal/1.10
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)
library(plyr) # needed for rename function
library(sp) # needed for point.in.polygon function
library(ggmap) # could use for geocoding addresses
library(ggplot2) # needed for plotting
Find number of arrests for which the geo coordinates weren’t given
no.geo.idx = nchar(arrest_tab$Location.1) == 0
n.geo.missing = sum( no.geo.idx )
narrests = dim(arrest_tab)[1]
n.geo.missing/narrests*100 # 39%
## [1] 38.87571
Find the number of incidents who dont have geo code info, but the incidentLocation is provided
has.location = nchar(arrest_tab$incidentLocation) > 0
sum(no.geo.idx & has.location)
## [1] 7650
#tmp = paste(arrest_tab$incidentLocation[no.geo.idx & has.location], "Baltimore, MD")
#gc = geocode(tmp[1:2490]) # restricted to 2500 api requests per day
Get the 2010 statistical community areas here Download the shape files and extract from the .zip file
setwd("csa_2010_boundaries/")
csa = readOGR(dsn=".",layer="CSA_NSA_Tracts")
## OGR data source with driver: ESRI Shapefile
## Source: ".", layer: "CSA_NSA_Tracts"
## with 56 features and 3 fields
## Feature type: wkbPolygon with 2 dimensions
csa.df = fortify(csa) # fortify turns the shape data into a data.frame
## Regions defined for each Polygons
csa.df = rename(csa.df, c("long"="X.feet","lat"="Y.feet")) # MD uses State Plane coords instead of lat/lon (see comments section)
convert = FALSE
if (convert){ # write a file to send to matlab code (described in comments section)
write.csv(csa.df[,c("lat","lon")], "csa-df.txt", quote=FALSE, na="",row.names=FALSE)
}
csa.converted.df = read.csv("csa-df_converted.txt",header=FALSE) # output of the matlab code, converted to lat/lon
setwd("..")
csa.converted.df = rename(csa.converted.df, c("V1"="lat","V2"="lon"))
csa.df = cbind(csa.df, csa.converted.df)
Now assign each of the arrest records to a neighborhood but this is only possible for the records that have geo info. This step takes about 15-20 seconds
ncsa = dim(csa)[1]
arrest_tab_geo = arrest_tab[!no.geo.idx,]
narrests.geo = dim(arrest_tab_geo)[1]
arrest_nbhd_id = vector(length = narrests.geo)
for (j in 1:ncsa) { # takes about 30 sec
idx = csa.df$id == j-1
polyx = csa.df$lon[idx]
polyy = csa.df$lat[idx]
in.poly = point.in.polygon(arrest_tab_geo$lon, arrest_tab_geo$lat, polyx,polyy)
in.poly= as.logical(in.poly)
arrest_nbhd_id[in.poly] = j - 1
}
arrest_tab_geo = cbind(arrest_tab_geo, arrest_nbhd_id)
For each neighborhood, count the number of arrests, using the table function
nbhd.narrests = as.data.frame(table(arrest_nbhd_id))
nbhd.narrests = rename(nbhd.narrests, c("arrest_nbhd_id"="id", "Freq"="narrests"))
nbhd.names= as.vector(csa$Neigh)
nbhd.narrests = cbind(nbhd.names, nbhd.narrests)
head(nbhd.narrests)
## nbhd.names
## 1 Allendale, Carroll-South Hilton, Gwynns Falls, Irvington, Saint Josephs, Uplands, Yale Heights, Lower Edmondson Village
## 2 Beechfield, Hunting Ridge, Ten Hills, Tremont, West Hills, Westgate
## 3 Belair-Edison, Clifton Park, Four By Four, Mayfield, Herring Run Park
## 4 Brooklyn, Curtis Bay, Fairfield Area, Hawkins Point, Curtis Bay Industrial Area
## 5 Canton, Patterson Park
## 6 Cedonia, Frankford, Cedmont, Parkside
## id narrests
## 1 0 1358
## 2 1 346
## 3 2 1752
## 4 3 2155
## 5 4 165
## 6 5 1330
Merge the arrest counts with the geometry data
csa.df = merge(csa.df, nbhd.narrests, by="id",all.x=TRUE)
Make a plot colored by number of arrests
g = ggplot(csa.df,aes(x=lon,y=lat,group=group))
g = g + geom_polygon(aes(fill=narrests)) + scale_fill_gradient(low="slategray1",high="slateblue4") # color the nbhds by narrests
g = g + geom_path(colour="gray75",size=.1) # draw lines separating the neighborhoods
g = g + ggtitle("Baltimore City Arrests 2011 - 2012") # add a title
g = g + theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(),
axis.title.x = element_blank(), axis.title.y = element_blank()) # remove axis labels
print(g)
What did you observe? First, it should be noted that out of the 104,528 arrest records, 40,636 of them (about 39%) did not have geocoded locations (latitude and longitude). Some of them (7,650) did have and adress in the incidentLocation field, so it would be possible to geocode these, though it would not contribute enormously. (I did however take a look at the ggmap library which can convert an address string into lat and lon by calling the google maps geocoding API, however it only allows 2,500 requests per day.) Therefore my analysis only includes the 61% of records which provided geocoded information.
Second, I need to note that it was a pain converting from the MD State Plane coordinates into longitude and latitude. I ended up using an external matlab function to do the conversion, since it seemed really confusing to do in R.
The results: As could probably be expected, the highest number of arrests occured in the downtown area, with the northwest area being notably high as well. The inner harbor neighborhood is among the lowest which makes sense as this area is more touristy. The neighborhoods in the central northern neighborhoods are also on the low end (I don’t know Baltiore but I’m guessing these are higher income neighborhoods).
For future analysis, it would be good to create a similar plot where the colors represent neighborhood income level. I would also like to add a layer showing the locations of transit stations, since these are commonly believed to attract crime.
What question are you asking?: Which is the most common crime per race?
What is the code you use to answer it?:
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="A"]),decreasing=TRUE))
##
## Unknown Offense 87-Narcotics 4E-Common Assault
## 98 42 22
## 87O-Narcotics (Outside) 24-Towed Vehicle 61-Person Wanted On War
## 13 10 6
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="B"]),decreasing=TRUE))
##
## Unknown Offense 87-Narcotics 87O-Narcotics (Outside)
## 31500 21783 5706
## 4E-Common Assault 97-Search & Seizure 79-Other
## 5500 3255 2879
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="H"]),decreasing=TRUE))
## 87-Narcotics
## 1
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="I"]),decreasing=TRUE))
##
## Unknown Offense 87-Narcotics 4E-Common Assault
## 92 44 14
## 97-Search & Seizure 24-Towed Vehicle 4C-Agg. Asslt.- Oth.
## 8 6 5
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="U"]),decreasing=TRUE))
##
## Unknown Offense 4E-Common Assault 87-Narcotics
## 893 179 171
## 87O-Narcotics (Outside) 4B-Agg. Asslt.- Cut 79-Other
## 51 48 46
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="W"]),decreasing=TRUE))
##
## Unknown Offense 87-Narcotics 4E-Common Assault
## 6065 2702 1024
## 87O-Narcotics (Outside) 55A-Prostitution 79-Other
## 740 671 528
What did you observe? I observed asside from the Unknown Offenses the most common crime for all races but 1 was Narcotics, the “race” had common Assault as their most common crime.
What question are you asking?: Which year had the largest number of arrests?
What is the code you use to answer it?:
a <- data.frame(id = arrest_tab$arrest, year = substr(arrest_tab$arrestDate,7,11))
head(sort(table(a$year), decreasing=TRUE),10)
##
## 2011 2012
## 52868 51660
What did you observe? 2011 had the largest number of arrests but there was no dramatic difference between the number of arrests for each year. There were 52,868 arrests in 2011 which was 1208 more than the number of arrests for 2012.
What are you asking:
I wanted to know if the ratio of female to male crime rates are similar accross the difference races.
What is the code you used to answer it?
mytab = table(race=arrest_tab$race, sex=arrest_tab$sex)
#mydf$V1 = NULL
mydf = as.data.frame.matrix(mytab)
mydf$ratio = mydf$F/mydf$M
mydf$ratio <- round(mydf$ratio, 3)
mydf = mydf[-1,]
mydf
## V1 F M ratio
## A 0 37 205 0.180
## B 0 14663 72605 0.202
## H 0 0 1 0.000
## I 0 34 184 0.185
## U 0 183 1566 0.117
## W 0 4514 10534 0.429
mydf$race <- c("A","B","H","I","U","W")
ggplot(data=mydf, aes(x=race, y=ratio, fill=race)) + geom_bar(stat="identity")
What are your observations? I found out that for most race, the ratio of female to male crime rates is about 0.2. This tells us that there are about 1 female for ever 5 male criminal offenses. I also saw that the female to male crime ratio for the white population is about 0.43 which is much higher than the others.
What question are you asking?: Do police officers go out of their way to arrest more black people than white people?
What is the code you use to answer it?:
library(ggplot2)
# added by HCB to not dirty global environment
kpai=function()
{
arrest_tab=read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)
tmp=arrest_tab$sex
arrest_tab$sex=arrest_tab$race
arrest_tab$race=tmp
police_tab=read.csv("Police_Stations.csv", stringsAsFactors=FALSE)
tmp=gsub("\\)","",gsub("\\(","",arrest_tab$Location))
tmp=strsplit(tmp,split=",")
arrest_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
arrest_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
tmp=gsub("\\)","",gsub("\\(","",police_tab$Location))
tmp=strsplit(tmp,split=",")
police_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
police_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
plt=ggplot()
plt=plt+geom_point(data=arrest_tab[arrest_tab$race=='B',],aes(x=lon,y=lat),color="black",alpha=.1)
plt=plt+geom_point(data=arrest_tab[arrest_tab$race=='W',],aes(x=lon,y=lat),color="white",alpha=.1)
plt=plt+geom_point(data=police_tab,aes(x=lon,y=lat),color="red")
print(plt)
}
kpai()
## Warning: Removed 33454 rows containing missing values (geom_point).
## Warning: Removed 6251 rows containing missing values (geom_point).
What did you observe? I was surprised to find that most of the arrests made some distance away from the cluster of police stations were for white people. It would be interesting to investigate what kinds of crimes might have been committed so far away from the stations, and why white people stand out as being arrested at that distance, especially in the east.
Is there a correlation between arrest location and the locations of vacant buildings?
# Read in the larger arrests dataset
recentArrests <- read.csv("full_BPD_Arrests.csv", header=TRUE)
# Convert the ArrestDate column
recentArrests$ArrestDate <- as.Date(recentArrests$ArrestDate, "%m/%d/%Y")
# Remove old arrests and arrests without reported locations
recentArrests <- subset(recentArrests, recentArrests$ArrestDate>="2015-01-01")
recentArrests <- subset(recentArrests, recentArrests$Location.1 != "")
# Read in the vacant buildings dataset
vacantBldgs = read.csv("Vacant_Buildings.csv", stringsAsFactors=FALSE,
header=TRUE)
# Remove rows without a location
vacantBldgs <- subset(vacantBldgs, vacantBldgs$Location.1 != "")
# Extract lat and lon into numeric columns
tmp = vacantBldgs$Location.1
tmp=gsub("\\)","",gsub("\\(","",vacantBldgs$Location.1))
tmp=strsplit(tmp, split=", ")
vacantBldgs$lon=as.numeric(sapply(tmp,function(x) x[2]))
vacantBldgs$lat=as.numeric(sapply(tmp,function(x) x[1]))
tmp = recentArrests$Location.1
tmp=gsub("\\)","",gsub("\\(","",recentArrests$Location.1))
tmp=strsplit(tmp, split=", ")
recentArrests$lon=as.numeric(sapply(tmp,function(x) x[2]))
recentArrests$lat=as.numeric(sapply(tmp,function(x) x[1]))
balto_map = subset(map_data("county", region="maryland"),subregion=="baltimore city")
plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=vacantBldgs,aes(x=lon,y=lat),color="blue",alpha=.1)
plt=plt+geom_point(data=recentArrests,aes(x=lon,y=lat),color="red",alpha=.1)
print(plt)
Based on the plot we produced, we observe that arrests are more concentrated around areas with vacant buildings.
Note: because there were no dates associated with when buildings became vacant (the data is updated twice a month), we used arrest records from the month of January 2015 to minimize the number of “extra” vacant buildings. We assume that there was not a significant number of new vacant buildings added since the month of January.
What question are you asking?:
What is the female to male ratio of different types of crimes on Valentine’s Day, 2011?
library(plyr)
vdaytab <- table(incident=arrest_tab$incidentOffense,sex=arrest_tab$sex,date=arrest_tab$arrestDate=="02/14/2011")
vdaytab <- vdaytab[,,-1]
vdaydf = as.data.frame.matrix(vdaytab)
vdaydf$ratio = vdaydf$F/vdaydf$M
vdaydf$ratio <- round(vdaydf$ratio, 3)
vdaydf <- subset(vdaydf, M > 0 | F > 0)
vdaydf[-1]
## F M ratio
## 111-Protective Order 0 1 0.000
## 112-Traffic Related Incident 1 0 Inf
## 115-Trespassing 0 1 0.000
## 23-Unauthorized Use 2 0 Inf
## 24-Towed Vehicle 0 2 0.000
## 26-Recovered Vehicle 1 0 Inf
## 3K-Robb Res. (Ua) 0 2 0.000
## 3P-Robb Misc. (Ua) 0 1 0.000
## 4B-Agg. Asslt.- Cut 1 1 1.000
## 4C-Agg. Asslt.- Oth. 0 1 0.000
## 4E-Common Assault 1 4 0.250
## 54-Armed Person 0 1 0.000
## 5D-Burg. Oth. (Force) 0 1 0.000
## 6C-Larceny- Shoplifting 0 2 0.000
## 79-Other 1 8 0.125
## 7A-Stolen Auto 1 0 Inf
## 87-Narcotics 0 13 0.000
## 87O-Narcotics (Outside) 1 4 0.250
## 97-Search & Seizure 1 1 1.000
## Unknown Offense 11 52 0.212
vdaydf$offense <- c("111", "112", "115", "23", "24", "26", "3K", "3P", "4B", "4C", "4E", "54", "5D", "6C", "79", "7A", "87", "87O", "97", "Unknown")
vdaydf$offense_description <- c("(111) Protective Order", "(112) Traffic Related Incident", "(115) Trespassing", "(23) Unauthorized Use", "(24) Towed Vehicle", "(26) Recovered Vehicle", "(3K) Robb Res. (Ua)", "(3P) Robb Misc. (Ua)", "(4B) Agg. Asslt.- Cut", "(4C) Agg. Asslt.- Oth.", "(4E) Common Assault", "(54) Armed Person", "(5D) Burg. Oth. (Force)", "(6C) Larceny- Shoplifting", "(79) Other", "(7A) Stolen Auto", "(87) Narcotics", "(87O)Narcotics (Outside)", "(97) Search & Seizure", "Unknown Offense")
ggplot(data=vdaydf, aes(x=offense, y=ratio, fill=offense_description)) + geom_bar(stat="identity") + ggtitle("Valentine's Day 2011 F/M Crime Ratio") + theme(axis.text.x = element_text(angle=90))
# Removing 0 and Inf ratios
v <- subset(vdaydf, ratio > 0 & ratio < Inf)
v[-1]
## F M ratio offense offense_description
## 4B-Agg. Asslt.- Cut 1 1 1.000 4B (4B) Agg. Asslt.- Cut
## 4E-Common Assault 1 4 0.250 4E (4E) Common Assault
## 79-Other 1 8 0.125 79 (79) Other
## 87O-Narcotics (Outside) 1 4 0.250 87O (87O)Narcotics (Outside)
## 97-Search & Seizure 1 1 1.000 97 (97) Search & Seizure
## Unknown Offense 11 52 0.212 Unknown Unknown Offense
v$offense <- c("4B", "4E", "79", "87O", "97", "Unknown")
v$offense_description <- c("Agg. Asslt.- Cut", "Common Assault", "Other", "Narcotics (Outside)", "Search & Seizure", "Unknown Offense")
ggplot(data=v, aes(x=offense, y=ratio, fill=offense_description)) + geom_bar(stat="identity") + ggtitle("Valentine's Day 2011 F/M Crime Ratio")
What did you observe? We were surprised to find that there were significantly more Male arrests than Female arrests. In the categories where Males and Females were arrested for the same crime, only about 1/4 of them were Female. We also thought it was interesting that there was only 1 Female Narcotics related arrest. Whether or not 17 Narcotics related Male arrests were related to Valentine’s Day remains to be seen.
What are you asking?
What types of crimes have increased and decreased in frequency?
What is the code you use to answer it?
library(ggplot2)
library(stringr)
require(data.table)
## Loading required package: data.table
# Filter the list so that it only includes types of offenses that have happened over 500 times
# (So that it has enough data to be meaningful)
arrest_type <- table(arrest_tab$incidentOffense)
arrest_type[arrest_type >= 500]
##
## 115-Trespassing 24-Towed Vehicle 4B-Agg. Asslt.- Cut
## 871 2994 1195
## 4C-Agg. Asslt.- Oth. 4D-Agg. Asslt.- Hand 4E-Common Assault
## 1556 618 6739
## 55-Disorderly Person 55A-Prostitution 5A-Burg. Res. (Force)
## 923 1398 847
## 6C-Larceny- Shoplifting 75-Destruct. Of Property 79-Other
## 1849 686 3461
## 87-Narcotics 87O-Narcotics (Outside) 97-Search & Seizure
## 24744 6515 3670
## Unknown Offense
## 38649
arrest_type_list <- names(arrest_type[arrest_type >= 500])
arrest_type_list
## [1] "115-Trespassing" "24-Towed Vehicle"
## [3] "4B-Agg. Asslt.- Cut" "4C-Agg. Asslt.- Oth."
## [5] "4D-Agg. Asslt.- Hand" "4E-Common Assault"
## [7] "55-Disorderly Person" "55A-Prostitution"
## [9] "5A-Burg. Res. (Force)" "6C-Larceny- Shoplifting"
## [11] "75-Destruct. Of Property" "79-Other"
## [13] "87-Narcotics" "87O-Narcotics (Outside)"
## [15] "97-Search & Seizure" "Unknown Offense"
# All the arrests for those types that we singled out
arrest_dates <- arrest_tab[arrest_tab$incidentOffense %in% arrest_type_list ,]
# Set all the counts to one so we can sum them when we aggregate
arrest_table <- data.table(arrest_dates)
arrest_table <- arrest_table[, count := 1, by=arrestDate]
# Aggregate all the data together by how many times a type of incident occured on each day
arrest_table <- setNames(aggregate(arrest_table$count, list(arrest_table$incidentOffense, arrest_table$arrestDate), sum), c("incidentOffense","arrestDate","count"))
# Graph it
ggplot(data=arrest_table, aes(x=arrestDate, y=count, group=incidentOffense, colour=incidentOffense)) +
geom_line() +
geom_point()
# The last graph has too many data points crammed in, so we aggregate the data by month and graph it again
# Set a shortdate column with a simple month and year so the system can aggregate them on it
arrest_table$shortdate <- strftime(as.Date(arrest_table$arrestDate, "%m/%d/%Y"), format="%y/%m")
arrest_table <- setNames(aggregate(arrest_table$count, list(arrest_table$incidentOffense, arrest_table$shortdate), sum), c("incidentOffense","arrestMonth","count"))
ggplot(data=arrest_table, aes(x=arrestMonth, y=count, group=incidentOffense, colour=incidentOffense)) +
geom_line() +
geom_point()
# Lets take a look at the graph without the highest two types of occurences, because those two are so much higher
low_arrests <- arrest_table[arrest_table$incidentOffense!="87-Narcotics",]
low_arrests <- low_arrests[low_arrests$incidentOffense!="Unknown Offense",]
ggplot(data=low_arrests, aes(x=arrestMonth, y=count, group=incidentOffense, colour=incidentOffense)) +
geom_line() +
geom_point()
# Take a closer look at the graph from the most common type of occurence that's now "Unknown"
# (87-Narcotics)
high_arrests <- arrest_table[arrest_table$incidentOffense=="87-Narcotics",]
high_arrests
## incidentOffense arrestMonth count
## 13 87-Narcotics 11/01 740
## 29 87-Narcotics 11/02 830
## 45 87-Narcotics 11/03 824
## 61 87-Narcotics 11/04 887
## 77 87-Narcotics 11/05 1140
## 93 87-Narcotics 11/06 1080
## 109 87-Narcotics 11/07 1034
## 125 87-Narcotics 11/08 991
## 141 87-Narcotics 11/09 1044
## 157 87-Narcotics 11/10 1052
## 173 87-Narcotics 11/11 822
## 189 87-Narcotics 11/12 983
## 205 87-Narcotics 12/01 1030
## 221 87-Narcotics 12/02 982
## 237 87-Narcotics 12/03 1245
## 253 87-Narcotics 12/04 1042
## 269 87-Narcotics 12/05 1165
## 285 87-Narcotics 12/06 1088
## 301 87-Narcotics 12/07 1117
## 317 87-Narcotics 12/08 1444
## 333 87-Narcotics 12/09 1245
## 349 87-Narcotics 12/10 1127
## 365 87-Narcotics 12/11 940
## 381 87-Narcotics 12/12 892
ggplot(data=high_arrests, aes(x=arrestMonth, y=count, group=incidentOffense, colour=incidentOffense)) +
geom_line() +
geom_point()
# Now lets split this into two lines, one for each year
high_arrests$year <- str_split_fixed(high_arrests$arrestMonth, "/", 2)[,1]
high_arrests$month <- str_split_fixed(high_arrests$arrestMonth, "/", 2)[,2]
ggplot(data=high_arrests, aes(x=month, y=count, group=year, colour=year)) +
geom_line() +
geom_point()
What did you observe?
Between the beginning of 2011 and the end of 2012, crime rates fluctuated quite a bit, but seemed to end up at around the same level. To be more specific, the crime rates for the most popular types of crimes all seemed to increase in frequency between the two dates, but then drop back to around what it started at.
After looking at graphs for each year specifically, it looks like crime rises towards the middle of the year and then dies down towards winter. This could be due to the weather or perhaps the holiday scheduling of the police (or of criminal activity).
However, it does look like the rate of crime for this time period increased right in the middle. In the year-separated-graph for Narcotics, the end of 2011 had clearly higher levels than the start of 2011, and the start of 2012 was higher than the end of 2012. This could be attributed to lower levels of employment in that time
Do people commit fewer crimes during the holidays?
(function() {
source("calendarHeat_modified.R")
arrest_tab=read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)
arrest_counts_tab <- table(arrest_tab$arrestDate)
arrest_counts_df <- data.frame(date=as.Date(names(arrest_counts_tab), "%m/%d/%Y"), num_arrests=as.vector(arrest_counts_tab))
calendarHeat(arrest_counts_df$date, arrest_counts_df$num_arrests, color="g2r", varname="Number of Arrests Per Day")
})()
## Loading required package: lattice
## Loading required package: grid
## Loading required package: chron
Looking at the heat map for 2011 and 2012, there is a very distinct reduction in number of arrests on Thanksgiving (the fourth Thursday of November). Moreover, on Christmas (the last Sunday of December in 2011 and the last Tuesday of December in 2012) and the days leading up to Christmas, we observe green squares as well. It’s hard to tell whether this reduction in arrests is due to fewer officers working on the holidays or due to fewer crimes being committed. In making this judgement, it would be useful to have data on the number of officers working each day.
What question are you asking?: How do the number of arrests change from 4th of July to Christmas for 2011 and 2012?
What is the code you use to answer it?:
a <- table(arrest_tab$arrest[arrest_tab$arrestDate=="07/04/2011"])
b <- table(arrest_tab$arrest[arrest_tab$arrestDate=="12/25/2011"])
c <- table(arrest_tab$arrest[arrest_tab$arrestDate=="07/04/2012"])
d <- table(arrest_tab$arrest[arrest_tab$arrestDate=="12/25/2012"])
num_of_arrests <- c(length(a),length(b),length(c),length(d))
barplot(num_of_arrests, main="Number of Arrests for 4th of July and Christmas", names.arg=c("07/04/2011","12/25/2011","07/04/2012","12/25/2012"))
What did you observe?: There are actually mixed results. For the 4th of July, there were more arrests in 2012(132) than in 2011(113). For Christmas, there were more arrests in 2011(53) than in 2012(43). There are many more arrests on the 4th of July for both years.
What are you asking:
We wanted to answer this question: What percentage of crime per sex is theft related? In other words; if you are arrested as a male what are the odds that your offense was theft related? Does this proportion change if you are female? To answer this we divide the total number of theft related crimes per sex by the total number of crimes per sex. For the purposes of this question we filter out unknown offenses and unknown sexes.
What is the code you used to answer it?
theft_regex <- regexpr("burg|stolen|robb|larceny", arrest_tab$incidentOffense, ignore.case=T, perl=T)
totalTab <- table(arrest_tab$incidentOffense[arrest_tab$sex != "" & arrest_tab$incidentOffense != "Unknown Offense"], arrest_tab$sex[arrest_tab$sex != "" & arrest_tab$incidentOffense != "Unknown Offense"])
robbTab <- table(arrest_tab$incidentOffense[theft_regex > -1], arrest_tab$sex[theft_regex > -1])
robbSum <- colSums(robbTab)
totalSum <- colSums(totalTab)
robbData <- data.frame(Female = robbSum[1], Male = robbSum[2])
totalData <- data.frame(Female = totalSum[1], Male = totalSum[2])
theftFemaleProp <- (robbData$Female / totalData$Female) * 100
theftMaleProp <- (robbData$Male / totalData$Male) * 100
theftFemaleProp
## [1] 9.831437
theftMaleProp
## [1] 9.40193
What did you observe? We were suprised to see that females are more likely to be arrested for a theft related crime. This is mainly due to the larceny arrests.
larceny_regex <- regexpr("larceny", arrest_tab$incidentOffense, ignore.case=T, perl=T)
larcenyTab <- table(arrest_tab$incidentOffense[larceny_regex > -1], arrest_tab$sex[larceny_regex > -1])
larcenySum <- colSums(larcenyTab)
totalSum <- colSums(totalTab)
larcenyData <- data.frame(Female = larcenySum[1], Male = larcenySum[2])
totalData <- data.frame(Female = totalSum[1], Male = totalSum[2])
larcFemaleProp <- (larcenyData$Female / totalData$Female) * 100
larcMaleProp <- (larcenyData$Male / totalData$Male) * 100
larcFemaleProp
## [1] 6.639856
larcMaleProp
## [1] 3.971139
slices <- c(theftFemaleProp, theftMaleProp)
lbls <- c("Female", "Male")
pie(slices, labels = lbls, main="Theft")
slices <- c(larcFemaleProp, larcMaleProp)
lbls <- c("Female", "Male")
pie(slices, labels = lbls, main="Larceny")
What question are you asking?:
I am asking if there are differences in the distribution of arrests throughout the day depending on the age of the arrestee.
What is the code you use to answer it?:
#Bin the arrest data into age and time period chunks
ageGrid = seq(0,90,10)
ageFactor=cut(arrest_tab$age, breaks=ageGrid)
hourGrid = seq(0, 24, 3)
hourFactor=cut(arrest_tab$hours, breaks=hourGrid)
#Create a mosaic plot
mosaicplot(table(ageFactor, hourFactor), color=colorRampPalette(c('dark cyan','white','dark goldenrod'))(8), main="Number of Arrests by Age and Time of Day", xlab ="Age", ylab = "Time")
What did you observe?
The distribution of arrest times across age groups is actually quite similar. However, there is a trend where the young and the elderly are less likely to be arrested during beginnnings and ends of the day. Also, people 20 years old and younger are active in the afternoon and evening rather than the morning. The drop in activity in the hours of 3am-6am observed by Mihai is shared by all age groups.
What question are you asking?:
We were curious about how temperature relates to crime. Does crime count correlate with temperature? We used average temperature values for Baltimore from wunderground and normalized our counts for each temperature by the number of days that experienced that temperature. It seems that there should be more crimes in warmer weather because both criminals and victims should be more likely to spend time outside when it is warmer.
What is the code you use to answer it?:
# load weather data from csv file
baltimore_weather <- read.csv("baltimore_weather.csv", stringsAsFactors=FALSE)
# store abbrevation of weather data and add formatting for date
tempAbr <- baltimore_weather[,c("EST","Mean.TemperatureF")]
tempAbr$EST <- as.Date(baltimore_weather$EST,"%Y-%m-%d")
# copy arrest data and add formatting for date
Date_Arrests <- arrest_tab
Date_Arrests$arrestDate <- as.Date(Date_Arrests$arrestDate,"%m/%d/%Y")
# join weather and crime tables to match days together
tempMerge<-merge(x=Date_Arrests,y=tempAbr, by.x ="arrestDate", by.y="EST" )
tableCrime <- table(tempMerge$Mean.TemperatureF)
tableTemp <- table(baltimore_weather$Mean.TemperatureF)
# plot crime count versus temperature, normalized by day count for each temperature
plot(tableCrime/tableTemp, xlab="Temperature (F)", ylab="Normalized Crime Count")
divTable = tableCrime/tableTemp
divtmatrix = as.data.frame(divTable)
# cast factors as numeric values for correlation
divtmatrix[,1] <- as.numeric(as.character(divtmatrix[,1]))
divtmatrix[,2] <- as.numeric(divtmatrix[,2])
cor(divtmatrix[,1],divtmatrix[,2])
## [1] 0.3673159
What did you observe?:
While we suspected that crime count would correlate with temperature, we found that the correlation coefficient was only 0.3673, suggesting that crime is not influenced linearly by the temperature. There also seems to be no clear non-linear dependence of crime on temperature for our data. Compared to this (http://crime.static-eric.com/) analysis of Chicago, temperature appears to relate to crime in Baltimore to a much smaller degree.